Method of synthesizing a block of a speech signal in a celp-type coder

ABSTRACT

A new scheme to generate the stochastic excitation for a CELP-type speech codec based upon a hybrid stochastic codebook search technique including use of regular pulse excitation codebooks is described. From the ideal RPE sequence the position of the first nonzero pulse and the position of the pulse with maximum amount as well as the overall sign of the RPE sequence are determined. The corresponding target vectors and pulse responses of the synthesis filter are stored in databases belonging to the positions of the maximum pulse, respectively. These databases are used to derive the stochastic codebook via the so-called LBG-algorithm. Once the codebook has become available, the position of the maximum pulse serves as pre-selection measure to limit the search for the &#34;best&#34; candidate vector to a &#34;small&#34; subset of the stochastic codebook.

DESCRIPTION

This invention relates to speech coding, particularly to a method ofsynthesizing a block of a speech signal in a CELP-type (Code ExcitedLinear Predictive) coder, the method comprising the steps of applying anexcitation vector to a synthesizer filter of the coder, said excitationvector consisting of two gain normalized components derived, on the onehand, from an adaptive codebook and from a stochastic codebook, on theother hand.

Efficient speech coding methods are continuously developed. Theprinciples of Code Excited Linear Prediction (CELP) are described in anarticle of M. R. Schroeder and B. S. Atal: "Code-Excited LinearPrediction (CELP): High Quality Speech at Very Low Bit Rates"Proceedings of the IEEE International Conference of Acoustics, Speechand Signal Processing--ICASSP, Volume 3, pp 937-940, March 1985. Thebasic structure of the CELP-type speech coders developed up to date isquite similar. A LPC synthesis filter (LPC=Linear Predictive Coding) isexcited by so-called "adaptive" and "stochastic" excitations. The speechexcitation vector is scaled by its respective gain and the gains areoften Jointly optimized.

The CELP approach offers good speech quality at low bit rates, however,degradations of speech quality can be heard if the synthesized speech iscompared with the original (band limited) speech, especially at bitrates below 16 kb/sec. One reason is the need to restrict thecomputational requirements of the search for the "best" excitation toreasonable values in order to make the algorithm practical. Thereforemany CELP-type coders use simplified structures for the codebooks asalready indirectly suggested by Schroeder/Atal in the said basicarticle. Such methods cause some degradations in speech quality. It isknown that the speech quality is strongly related to the "quality" ofthe stochastic codebook(s) which give(s) the innovation sequence for thespeech signal to be synthesized. Although it is possible to implementvery good full search codebooks at reasonable data rates, it is stillimpossible to implement a full search in real time on existing digitalsignal processors. For overcoming this problem a reasonable approach isa pre-selection of a relatively small number of "good" code vectorcandidates, so that the codebook search can be done in real time and thespeech quality is retained.

So-called trained codebooks can have adavantages over algebraiccodebooks in terms of speech quality, nevertheless, in a lot of today'sCELP-type speech coders algebraic codebooks are employed to provide thestochastic excitation to reduce complexity and memory requirements.

FIG. 1 shows the typical structure of an "analysis-by-synthesis-loop" ofa CELP-type speech codec. A common scheme is that the synthesis filter,i.e. blocks 1 and 2, providing the spectral envelope of the speechsignal to be coded is excited with two different excitation parts. Oneof them is called "adaptive excitation". The other excitation part iscalled "stochastic excitation". The first excitation part is taken froma buffer where old excitation samples of the synthesis filter arestored. Its task is to insert the harmonic structure of speech. Thesecond excitation part is a so-called stochastic excitation whichrebuilds the noisy components of the signal. Both excitation parts aretaken from "codebooks", i.e. from an adaptive codebook 3 and from astochastic codebook 4. The adaptive codebook 3 is time variant andupdated each time a new excitation of the synthesis filter has beenfound. The stochastic codebook 4 is fixed. A synthetic speech signal isgenerated already in the speech encoder by a process called"analysis-by-synthesis". Codebooks 3, 4 are searched for the vectorswhich scaled and filtered versions (gains g1, g2) give the "best"approximation of the signal to be transmitted as "reconstructed targetvector". The "best" excitation vectors are chosen according to an errormeasure (block 5) which is computed from the perceptual weighted errorvector In block 6.

In theory, the approximation of the target vector can be performed quitewell in terms of perception even at relatively low bit rates. Inpractice, however, there are limitations namely, as already mentioned,the time required to perform the codebook search and the memory neededto store the codebooks. Therefore, only suboptimal search procedures canbe applied to keep the complexity low. The codebooks 3, 4 are searchedfor the "best" code vector sequentially and each single codebook searchis performed also suboptimal to some extent. These limitations can causea perceptible decrease in speech quality. Therefore, a lot of work hasbeen done in the past to find the excitation with reasonable effortwhile retaining high speech quality. One approach for simplifying thesearch procedures is described in EP-A-0 515 138.

Typically, CELP coders are driven by the stochastic excitation, sincethe adaptive codebook 3 only depends on vectors previously chosen fromthe stochastic codebook 4. For this reason, the content of thestochastic code book 4 is not only important for rebuilding noisycomponents of speech but also for the reproduction of the harmonicparts. Therefore, most CELP-type coders mainly differ in the stochasticexcitation part. The other parts are often quite similar.

As already mentioned there are two different stochastic codebookapproaches, i.e. trained codebooks and algebraic codebooks. Trainedcodebooks often have candidate vectors with all samples being nonzeroand different in amplitude and sign. In contrast, algebraic codebooksusually have only a few nonzero samples and often the amplitudes of allnonzero samples are set to one. A full search in a trained codebooktakes more complexity than a full search in an algebraic codebook of thesame size. In addition, there is no memory required to store analgebraic codebook, since the candidate vectors can be constructedonline during the codebook search is performed. Therefore, an algebraiccodebook seems to be the better choice. However, to ensure goodreproduction of speech, a "large" number of different codevectorcandidates including speech characteristics is needed. Due to this,trained codebooks have advantages over algebraic ones. On the otherhand, the "best" candidate vector should be found with "small" effort.These are contrary requirements.

SUMMARY OF THE INVENTION

It is an object of the invention to make trained codebooks applicable bya new process of preselecting a reasonable number of candidatecodevectors in order to limit the "closed-loop" search for the bestcodevector to a "small" subset of candidate codevectors.

It is a further object of the invention to do such preselection withlimited efforts such that the following codebook search applied to thepreselected candidate vectors takes clearly less complexity than a fullsearch in the codebook.

As a first approach to the invention such preselection measure isderived from an "ideal" RPE sequence (RPE=Regular Pulse Excitation).

According to the invention a method for synthesizing a block of a speechsignal in a CELP-type coder comprises the step of applying an excitationvector to a synthesizing filter of the coder, said excitation vectorconsisting of two gain normalized components derived, on the one handfrom an adaptive codebook and from a stochastic codebook, on the otherhand, said method being characterized in that for limiting thecomputational effort of the stochastic codebook components search, anideal regular pulse excitation sequence is computed from a target vectorderived from a weighted speech sample signal and the impulse response ofthe synthesis filter followed by determination of four parameterstherefrom, namely

the position of the first nonzero pulse of the ideal RPE excitationsequence,

the position of the maximum pulse within said RPE excitation sequence,

the overall sign of the RPE sequence defined as the respective sign ofsaid maximum pulse, and

the position of the corresponding part of the pulse codebook. as theposition of the maximum pulse,

said four parameters being transmitted to the speech decoder.

The starting point of the invention is the Regular Pulse Excitation(RPE) which Is principally known since the early eighties. Theinvention, however, takes specific advantages from this approach.

In the following, the computing of an ideal RPE is briefly described.For more details specific reference is made to a paper by Peter Kroon:"Time-domain coding of (near) toll quality speech at rates below 16kb/s", Delft University of Technology, March 1985.

The Regular Pulse Excitation (RPE)

Assume the excitation vector to be N samples long. In general, each ofthose samples has different sign and amplitude. In practice, it isnecessary either to limit the number of codevectors and/or to reduce thenumber of nonzero pulses in the excitation vector in order to makecodebook search possible with today's signal processors. One possibilityto reduce the number of nonzero pulses is to employ RPE. RPE means, thatthe spacing between adjacent nonzero pulses is constant. If for exampleevery second. excitation pulse has nonzero amplitude, there are twopossibilities to place N/2 nonzero pulses in a vector of the length N.The first, third, fifth, . . . pulse is nonzero or the second. fourth,sixth, . . . pulse is nonzero. If the number of nonzero pulses is L,L<=N, every (N/L)-th pulse is nonzero and there are (N-(N/L)*(L-1))possibilities to place L nonzero pulses as RPE sequence in a vector oflength N (both divisions are integer-divisions). That means the firstnonzero pulse can be located at (N-(N/L)*(L-1)) different positions. Thebest set of pulse amplitudes for those different possibilities can becomputed in a straightforward manner. The following variables aredefined:

    ______________________________________    p      target vector to rebuild, (1*N)-Matrix    h      impulse response of synthesis filter, (1*N)-Matrix    H      impulse response matrix, (N*N)-Matrix    M      matrix which gives the contribution of the nonzero pulses           in excitation vector, (N*L)-Matrix    b      vector containing L non zero pulse amplitudes and signs,           (1*L)-Matrix    c      excitation vector, (1*N)-Matrix    c'     filtered excitation vector, (1*N)-Matrix    e      difference vector between target vector and filtered code-           vector (error vector)    E      error measure.    ______________________________________

The excitation vector is given by

    c=b·M,

the matrix product of vector b and matrix M. Its filtered version is

    c'=b·M·H.

The error to be minimized is the difference between the target vectorand this signal.

    e=p-c'

The error measure is the simple Euclidean distance measure.

    E=e·e.sup.T

Replacing e by the above given equations, we obtain

    E=p·p.sup.T -2·H.sup.T ·M.sup.T ·b.sup.T +b·M·H·H.sup.T ·M.sup.T ·b.sup.T.

The partial derivation ##EQU1## leads to the "best" set of amplitudesand signs which are computed by

    b.sup.T =p·H.sup.T ·M.sup.T ·(M·H·H.sup.T ·M.sup.T)-.sup.1.

The impulse response matrix H looks like

    ______________________________________            h(0)   h(1)      h(2) h(3)   ..  h(N-1)            0      h(0)      h(1) h(2)   ..  h(N-2)    H =     0      0         h(0) h(1)   ..  h(N-3)            0      0         0    h(0)   ..  h(n-4)            ..     ..        ..   ..     ..  ..            0      0         0    0      0   h(0)    ______________________________________

If, for example, L=N/2, M is structured as shown below for the first andsecond possibility to place pulses, respectively.

    ______________________________________         1      0      0    0   0    0    0    ..  ..   0         0      0      1    0   0    0    0    ..  ..   0    M.sup.(1) =         0      0      0    0   1    0    0    ..  ..   0         ..     ..     ..   ..  ..   ..   ..   ..  ..   ..         0      0      0    0   0    0    0    ..  1    0         0      1      0    0   0    0    0    ..  ..   0         0      0      0    1   0    0    0    ..  ..   0    M.sup.(2) =         0      0      0    0   0    1    0    ..  ..   0         ..     ..     ..   ..  ..   ..   ..   ..  ..   ..         0      0      0    0   0    0    0    ..  ..   1    ______________________________________

In general, each row of M has just a single element being 1, the otherelements are zero. The n-th row gives the position of the n-th pulse. Ifthere are m possibilities to place L pulses as RPE sequence, there are mdifferent versions of the matrix M. With m different matrixes M, thereare also m different sets of amplitudes. The set which provides thesmallest error E is denoted as "ideal" RPE sequence.

This method applied here may be called "hybrid" since the preselectionof codevectors to be tested in the "analysis-by-synthesis-loop" is doneoutside of said loop. The part of the codebook to which those loopsearch is applied is determined before the analysis-by-synthesis-loop isentered.

BRIEF DESCRIPTION OF THE DRAWING

The new synthesizing method according to the invention and adavantageousexamples therefore are described in detail in the following withreference to the drawings in which

FIG. 1 shows a speech analysis-by-synthesis-loop already explainedabove;

FIGS. 2(a) and 2(b) serve to explain a stochastic pulse codebook in itsrelation to an excitation generator;

FIG. 3 gives an example for L=N/2 pulses in an ideal RPE sequence inaccordance with the invention;

FIG. 4 explains the functioning of an excitation generator;

FIG. 5 depicts an example for a speech encoder as used for performingthe speech synthesizing method according to the invention; and

FIGS. 6(a) and 6(b) show for the reason of completeness of descriptionan example of the speech decoder as used in connection with the speechencoder of FIG. 5.

DETAILED DESCRIPTION OF THE INVENTION

At first, the RPE based preselection of a stochastic codebook part andthe derivation of the pulse codebook are described with reference toFIGS. 2(a), 2(b), 3 and 4.

The maximum pulse position of an "ideal" RPE sequence is used aspreselection measure to limit the closed loop codebook search to a"small" number of candidate vectors.

Assume the codebook structure given in FIG. 2(a) to be available. Thereis a pulse codebook having L parts (L=number of nonzero samples).Codebook part i (i=1,2, . . . ,L) consists of M_(i) vectors of Lsamples. These vectors are candidate vectors for the nonzero pulses ofan RPE sequence. The n-th sample of all vectors of the n-th part hasmaximum amount. The L parts are joined together to one codebook.

FIG. 2(b) shows as example for codebook part 2, how the preselectionprocedure works and a code vector is constructed. The "ideal" RPEsequence is computed as depicted in keywords in FIG. 2(a) and FIG. 2(b).The position of the first nonzero pulse, the maximum pulse position andthe overall sign are taken from the "ideal" RPE. If the maximum pulse isnegative, the overall sign is negative. Otherwise the overall sign ispositive. The overall sign is required since the pulse codebook 4acontains only codevectors with positive maximum pulse.

FIG. 3 shows the derivation of the "position of a first nonzero pulse",the "maximum pulse position" and the "overall sign" from an example RPEsequence. FIG. 4 gives an example how the excitation generator 14 ofFIG. 2(b) works. If the ideal RPE's maximum pulse is negative, allpulses of the pulse vector to be tested are multiplied by -1. If then-th nonzero sample of the ideal RPE sequence has maximum amount, then-th part of the pulse codebook is searched for the best candidatevector. That means that as a significant advantage of the invention, thecodebook search is applied to Just (100/(L))% of all candidate vectors.

As a result, the following parameters are transmitted to the speechdecoder:

position of the first nonzero pulse,

position of the maximum pulse (=codebook part to which closed-loopsearch is applied),

overall sign,

position in corresponding part of the pulse codebook.

The speech codec in which the above described scheme shall be introducedis run with a sufficient set of training speech data in order to derivethe pulse codebook described before. To generate the stochasticexcitation during the training process. the following is done:

The ideal RPE sequence is computed from the target vector to be rebuiltand the impulse response of the synthesis filter. The position of thefirst nonzero pulse, the maximum pulse position and the overall sign aretaken from the ideal RPE as given above.

If the n-th nonzero sample of the ideal RPE sequence has maximum amount,the normalized RPE sequence is stored in the n-th database. Thenormalization is performed in two steps. In the first step, the RPEsequence is normalized such that the maximum pulse has positive value.In the second step. the sequence obtained after the first step isdivided by the energy of the target vector to which the RPE sequencebelongs. This is done to remove the influence of the loudness of thesignal from the codebook entries. In this way, L databases are obtained.The databases contain "normalized waveforms". Therefore, also thecodebooks trained based on the databases contain "normalized waveforms".

For each database, codebook training is performed separately accordingto the LBG-algorithm. (For details see description in Y. Linde, A. Buzo,R. M. Gray: "An Algorithm for Vector Quantizer Design", IEEETransactions on Communications, January 1980).

Finally, the different codebooks are joined together such that the n-thpart of the overall codebook contains candidate vectors where the n-thsample has maximum amount.

An example of the speech codec which employs the new stochastic codebookscheme is described below with reference to FIG. 5. Note that the blockdiagram or scheme doesn't depend on this codec. It can also be used withother CELP-type speech codecs.

The synthesis filter shown in FIG. 5 gives the spectral envelope of thesignal. Another interpretation is that the short term correlation of thesignal is given by this filter. This filter is excited by vectors takenfrom codebooks which contain a reasonably large number of candidatevectors. One vector is taken from the adaptive codebook 3 where oldexcitation vectors are stored. This excitation part rebuilds theharmonic structure of speech (or the long term correlation of the speechsignal) and is called the "adaptive excitation". The second part of theexcitation is taken from the stochastic codebook 4. This codebookintroduces the noisy parts of the synthesized speech signal or theinnovation of the signal which cannot be provided by linear prediction.

With reference to FIG. 5, the computations are divided into frame andsubframe processings. A speech frame consists of N_(frame) speechsamples. The codec delay is N_(frame) times the sample period. Eachframe has k subframes of the length N_(frame) /k samples. Parameterswhich are computed once per frame are called "frame parameters".Parameters which are computed for each subframe are called "subframeparameters". First, the frame parameters are computed. These parametersare

LPC's (Linear Predictive Coefficients) derived via blocks 21, 22, 23,24, 25 and 28 (explained later) and

loudness derived via blocks 21, 26, 27 and 28 (explained later).

The LPC's out of block 28 describe the spectral envelope and theloudness value gives the loudness of the signal in the current speechframe. Then, the excitation of this synthesis filter is calculated foreach subframe. The excitation is described by the subframe parameters

position in adaptive codebook 3,

position in pulse codebook 4a,

maximum pulse position in block 15,

first nonzero pulse position in block 15,

overall sign in block 15, and

position in gain codebook 16.

These parameters are transmitted to the decoder (see FIG. 6b).

Before entering the LPC-analysis stage, a current speech frame iswindowed in block 21. LPC-analysis 22 is performed via LEVINSON-DURBINrecursion. The LPC's are transformed into LSF's (Line SpectrumFrequencies) in block 23 and vector-quantized in block 24. For furtheruse in the encoder the quantized LSF's are converted into quantizedLPC's in block 25. The LPC's are interpolated with the LPC's of theprevious speech frame in block 28. A loudness value is computed from thewindowed speech frame in block 26. quantized in block 27 andinterpolated with the loudness value of the previous frame In block 28.

Each speech subframe is weighted in block 20 to enhance the perceptualspeech quality. From the weighted speech subframe, the zero inputresponse of the synthesis filter 1 is subtracted in a first substractor29. The resulting signal is called "target vector". This target vectorhas to be rebuild by the "analysis-by-synthesis-loop". The followingcomputations are done for each subframe.

First, the adaptive excitation is taken from the adaptive codebook 3. Itis scaled by the optimal gain g1 and subtracted from the target vectorin a second subtractor 30. The remaining signal is to be rebuilt by thestochastic excitation. In accordance with the invention, the ideal RPEsequence is computed from the remaining signal to be rebuild and theimpulse response of the synthesis filter. The position of the firstnonzero pulse, the maximum pulse position and the overall sign are takenfrom the ideal RPE as described above.

The RPE sequence is computed once before the closed loop codebook searchis started. If the n-th nonzero sample of the ideal RPE has maximumamount, the codebook part n is searched closed-loop for the bestexcitation vector in blocks 4a via 14. Finally, the excitation of thesynthesis filter is computed from the stochastic and adaptiveexcitations and the respective gains g1, g2 and the adaptive codebook 3is updated.

FIG. 6(a) and 6(b) show in block diagrams essential parts of thedecoder. As in most analysis-by-synthesis-coders the operations to beperformed (except post processing) are quite similar to those onesalready performed in the corresponding encoder stages. Accordingly, adetailed description of the schemes of FIG. 6(a) and 6(b) is omitted. Todecode the transmitted parameters just a few table look-ups are requiredto obtain the filter coefficients for loudness and excitation of thesynthesis filter.

As shown in FIG. 6(b), the price to pay for the sake of bit rate neededto transmit the speech signal is that it cannot be reconstructedcompletely. Noisy components (coding noise) are introduced by the speechencoder which can be heard (more or less). To avoid annoying effects,post filtering is employed. The target is to suppress the coding noisewhile retaining the naturalness of the speech signal. In this codec apost filter 70 including long term and short term filtering is employedto increase the perceptual speech quality.

Summarizing the above, instead of applying the search for the stochasticexcitation to all pulse vector candidates, a hybrid search technique isused. After computation of the ideal RPE sequence, firstly the positionof first nonzero pulse and the position of the maximum pulse arecomputed in the "ideal" pulse vector. Second, the codebook search isperformed. Since there is one pulse vector codebook for each position ofthe maximum pulse, only the pulse vector codebook belonging to thisposition has to be searched for the "best" codevector. This techniqueaccording to the invention reduces the computational requirements forfinding the "best" stochastic excitation drastically compared withapplying the codebook search to all pulse vector codebooks.

What is claimed is:
 1. A method of synthesizing a block of a speechsignal in a CELP-type coder, the method comprising the steps of:applyingan excitation vector to a synthesizer filter of the coder, saidexcitation vector consisting of two gain normalized components derivedfrom an adaptive codebook and from a stochastic codebook, for limitingthe computational effort of the stochastic codebook components search,computing an ideal Regular Pulse Excitation (RPE) sequence followed bydetermining four parameters, namelythe position of the first nonzeropulse of the ideal RPE excitation sequence, the position of the maximumpulse within said RPE excitation sequence, the overall sign of theregular pulse excitation sequence defined as the respective sign of saidmaximum pulse, and the position of the corresponding part of the pulsecodebook, as the position of the maximum pulse, wherein the methodfurther comprises a step of transmitting said four parameters to aspeech decoder.
 2. The method according to claim 1, wherein, in order toremove influence of the loudness of a speech signal from entries of thepulse codebook, there is a further step of normalizing the RPE sequenceswhich are used for code-book-training.
 3. The method according to claim2, further comprising a step of performing normalization of said gaincomponents in two steps, namely a first step in which the RPE sequenceis modified such that the maximum pulse has positive value and in asecond step in which the sequence obtained after the first step isdivided by the energy of a target vector to which said RPE sequencebelongs.
 4. The method according to claim 1, wherein, in said step ofcomputing the Regular Pulse Excitation sequence, the Regular PulseExcitation sequence is computed from a target vector derived from aweighted speech sample signal and the pulse response of the synthesizerfilter.